-
-
Notifications
You must be signed in to change notification settings - Fork 17
Add versification warnings for invalid chapter or verse numbers in USFM #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #381 +/- ##
==========================================
+ Coverage 72.75% 72.81% +0.06%
==========================================
Files 424 424
Lines 36153 36213 +60
Branches 4991 4996 +5
==========================================
+ Hits 26302 26368 +66
+ Misses 8750 8744 -6
Partials 1101 1101 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
39cf6ef to
5ca0162
Compare
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning messages are a bit awkward. How difficult would it be to generate messages that are more appropriate for the new warnings?
@ddaspit reviewed 2 files and all commit messages, and made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).
pmachapman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning messages are a bit awkward. How difficult would it be to generate messages that are more appropriate for the new warnings?
Good idea. That requires a modification to Serval - see sillsdev/serval/#871
@pmachapman made 1 comment.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @Enkidu93).
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 reviewed 2 files and all commit messages, and made 5 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @pmachapman).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
} public UsfmVersificationError(
Rather than adding another constructor, is there way to just check the error type in CheckError() itself?
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):
return true; } if (!_verseRef.Value.Valid)
I'm surprised that these kinds of errors (especially the bad verse numbers) aren't captured in the verse ref's valid status somehow 🤔. It is what it is, but it'd certainly be nice if it were haha.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):
Type == UsfmVersificationErrorType.ExtraVerse || Type == UsfmVersificationErrorType.InvalidChapterNumber || Type == UsfmVersificationErrorType.InvalidVerseNumber
I wonder if we could supply an expected ref. We should have some idea of what verse or chapter is expected next while parsing. If we can make a reasonable guess, I think we should add it for consistency with our attempts for other types of errors, but if we can't no worries. If the bad chapter number is something like \c 1^, that's pretty easy to find without an expected ref, but what about something like \c text: Is that possible? That would not be quite so easy to track down.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):
// See whether the chapter number is invalid VerseRef verseRef = state.VerseRef.Clone();
Why do we need to do this? Does the number get changed when the state is updated so we need the raw value itself? It would be nice if we could just pass the bad verse ref to the existing constructor and check for an error in CheckError() - same with invalid verses. If we have to do it, we have to do it, but it'd be nice to have the error checking all in one place.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):
{ _errors.Add(versificationError); verseInError = true;
What's the particular duplicate error situation you're trying to avoid?
pmachapman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmachapman made 5 comments.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Rather than adding another constructor, is there way to just check the error type in
CheckError()itself?
I did that originally, but because we don't have an expected verse, and when I passed a verseRef it met the other conditions in CheckError(), it got confused with the other warnings. I figured another constructor made sense, as I also needed the actual string value to display to the user so they could resolve it. (If we don't show the actual string value to the user, the verse ref will only store "-1" for an invalid chapter, which isn't helpful to return in an error message)
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I'm surprised that these kinds of errors (especially the bad verse numbers) aren't captured in the verse ref's valid status somehow 🤔. It is what it is, but it'd certainly be nice if it were haha.
_verseRef.Value.Valid will return false, but as it is not clear just what the cause is (i.e. there is no comparable enum to UsfmVersificationErrorType returned by VerseRef), I deal with the invalid chapter/verse error just above this logic.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
I wonder if we could supply an expected ref. We should have some idea of what verse or chapter is expected next while parsing. If we can make a reasonable guess, I think we should add it for consistency with our attempts for other types of errors, but if we can't no worries. If the bad chapter number is something like
\c 1^, that's pretty easy to find without an expected ref, but what about something like\c text: Is that possible? That would not be quite so easy to track down.
I tried that, but it is hard to tell what it should have been. Chapters can be skipped in a USFM file. If we were to parse the string and guess, then we might as well not throw an error and use the parsed string?
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Why do we need to do this? Does the
numberget changed when the state is updated so we need the raw value itself? It would be nice if we could just pass the bad verse ref to the existing constructor and check for an error inCheckError()- same with invalid verses. If we have to do it, we have to do it, but it'd be nice to have the error checking all in one place.
I get the raw value and pass it into the error object so we can return it in the error message, so that the error message doesn't just say something like "invalid chapter number -1" but will say the actual value that caused the error. (the VerseRef struct does not record what the invalid chapter number value is).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
What's the particular duplicate error situation you're trying to avoid?
A missing verse error
ddaspit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ddaspit made 1 comment.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @Enkidu93).
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 made 5 comments and resolved 4 discussions.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
I did that originally, but because we don't have an expected verse, and when I passed a verseRef it met the other conditions in
CheckError(), it got confused with the other warnings. I figured another constructor made sense, as I also needed the actual string value to display to the user so they could resolve it. (If we don't show the actual string value to the user, the verse ref will only store "-1" for an invalid chapter, which isn't helpful to return in an error message)
I see, and you couldn't easily add logic there to differentiate between the cases?
That's interesting. For verses, you can still access the bad verse value via vref.Verse (even though vref.VerseNum doesn't give you any clues), but not so with chapters? It's strange to me that they are handled differently.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
_verseRef.Value.Validwill return false, but as it is not clear just what the cause is (i.e. there is no comparable enum toUsfmVersificationErrorTypereturned byVerseRef), I deal with the invalid chapter/verse error just above this logic.
I see. So even if you grab the verse's ValidStatus enum, it doesn't clue you in? If you're getting vref.Valid == false, then ValidStatus must be set to something other than ValidStatus.Valid - I wonder what it is 🤔
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
I tried that, but it is hard to tell what it should have been. Chapters can be skipped in a USFM file. If we were to parse the string and guess, then we might as well not throw an error and use the parsed string?
Well, that is true - that could be 2.0 😆. The USFM could skip chapters or verses, but if it is skipping, then that's the sort of thing we'd want to be catching anyways since this checking should all use the versification as the gold standard. I was just thinking that we could use some of the functions in ScrVers to see what the next expected reference is (e.g. something similar to what's done in ScrVers.FirstIncludedVerse() but only looking at subsequent verses). If it's impossible to have totally uninterpretable verses/chapters like \c text or \v word123word, then I'm fine with not having an expected ref, but if it is possible, I think we ought to include something like that since I believe we can make a reasonable guess and there would be no other way for the error to be actionable without running Ctrl+F in every USFM file.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 292 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
I get the raw value and pass it into the error object so we can return it in the error message, so that the error message doesn't just say something like "invalid chapter number -1" but will say the actual value that caused the error. (the VerseRef struct does not record what the invalid chapter number value is).
I see. Makes sense! Like I said above, it'd odd to me that the bad chapter value can't be accessed like the verses can, but we'll make do!
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
A missing verse error
OK, yeah, I see. I was just wondering if it were worth allowing both errors to register since they're capturing different problems. And if a user got two warnings like Missing MAT 1:2 and Invalid verse MAT 1:schmutz, that may be helpful in locating/understanding the problem.
pmachapman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmachapman made 4 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
I see, and you couldn't easily add logic there to differentiate between the cases?
I can if you think it is important to have just one constructor. I will need to pass the string value into that constructor, and update all uses of it?
For verses, you can still access the bad verse value via
vref.Verse(even thoughvref.VerseNumdoesn't give you any clues), but not so with chapters?
Correct. See https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/VerseRef.cs#L235-L246
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 94 at r1 (raw file):
So even if you grab the verse's
ValidStatusenum, it doesn't clue you in?
Correct. These errors are only really thrown via the Trace.
ValidStatusmust be set to something other thanValidStatus.Valid- I wonder what it is 🤔
ValidStatus will be ValidStatusType.OutOfRange.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
Well, that is true - that could be 2.0 😆. The USFM could skip chapters or verses, but if it is skipping, then that's the sort of thing we'd want to be catching anyways since this checking should all use the versification as the gold standard. I was just thinking that we could use some of the functions in
ScrVersto see what the next expected reference is (e.g. something similar to what's done inScrVers.FirstIncludedVerse()but only looking at subsequent verses). If it's impossible to have totally uninterpretable verses/chapters like\c textor\v word123word, then I'm fine with not having an expected ref, but if it is possible, I think we ought to include something like that since I believe we can make a reasonable guess and there would be no other way for the error to be actionable without running Ctrl+F in every USFM file.
Is a missing range error acceptable for this? These will be returned if the range that is invalid is where an expected range should be (see next comment).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):
Previously, Enkidu93 (Eli C. Lowry) wrote…
OK, yeah, I see. I was just wondering if it were worth allowing both errors to register since they're capturing different problems. And if a user got two warnings like
Missing MAT 1:2andInvalid verse MAT 1:schmutz, that may be helpful in locating/understanding the problem.
A missing range error will often be returned if the invalid chapter of verse number is a noticeable gap (i.e. noticed by CheckError()), such as:
warnings: [
"Invalid chapter number error in project TEA at “MAN 1.” (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)",
"USFM versification error in project TEA, expected verse “MAN 1:15”, actual verse “MAN -1:15”, mismatch type MissingChapter (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)",
"Only 7 segments were selected for training."
]
``
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 made 3 comments.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @pmachapman).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
I see, and you couldn't easily add logic there to differentiate between the cases?
I can if you think it is important to have just one constructor. I will need to pass the string value into that constructor, and update all uses of it?
For verses, you can still access the bad verse value via
vref.Verse(even thoughvref.VerseNumdoesn't give you any clues), but not so with chapters?Correct. See https://github.com/sillsdev/libpalaso/blob/master/SIL.Scripture/VerseRef.cs#L235-L246
It's not so much a matter of having one constructor as it is a matter of separation of concerns. If there's a way to keep all the error-checking in one place, that may be clearer and more maintainable, but it is fine as-is. We could always refactor later if we extend the functionality of this class further.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 128 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
Is a missing range error acceptable for this? These will be returned if the range that is invalid is where an expected range should be (see next comment).
I think that works, yeah. As long as there's a way for the user to track down where the problem is.
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 332 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
A missing range error will often be returned if the invalid chapter of verse number is a noticeable gap (i.e. noticed by
CheckError()), such as:warnings: [ "Invalid chapter number error in project TEA at “MAN 1.” (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)", "USFM versification error in project TEA, expected verse “MAN 1:15”, actual verse “MAN -1:15”, mismatch type MissingChapter (parallel corpus 69893d8434467c056f3542e5, monolingual corpus 69893d8334467c056f3542e1)", "Only 7 segments were selected for training." ] ``
I see. If it's a bad chapter, we'll also get a missing chapter error. And if it's a bad verse, we have the chapter and book, so unless it's Psalm 119, it shouldn't be too hard to find haha. Thank you, Peter!
pmachapman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmachapman made 1 comment and resolved 1 discussion.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @pmachapman).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
If there's a way to keep all the error-checking in one place, that may be clearer and more maintainable, but it is fine as-is.
I spent the past hour trying to move the logic all in CheckErrors() but it got more and more convoluted with slight change breaking more and more tests :-/ Unfortunately, it looks like the logic will stay where it is.
I can only console myself with the fact is that the logic is in the same source file.
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 made 2 comments.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @pmachapman).
src/SIL.Machine/Corpora/UsfmVersificationErrorDetector.cs line 49 at r1 (raw file):
Previously, pmachapman (Peter Chapman) wrote…
If there's a way to keep all the error-checking in one place, that may be clearer and more maintainable, but it is fine as-is.
I spent the past hour trying to move the logic all in
CheckErrors()but it got more and more convoluted with slight change breaking more and more tests :-/ Unfortunately, it looks like the logic will stay where it is.I can only console myself with the fact is that the logic is in the same source file.
Thank you, Peter!
5ca0162 to
68784ea
Compare
Fixes: #372
Example USFM:
Example warnings:
This change is